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Professional team sports provide an excellent domain for studying the dynamics of social com¬ 
petitions. These games are constructed with simple, well-defined rules and payoffs that admit a 
high-dimensional set of possible actions and nontrivial scoring dynamics. The resulting gameplay 
and efforts to predict its evolution are the object of great interest to both sports professionals and en¬ 
thusiasts. In this paper, we consider two online prediction problems for team sports: given a partially 
observed game Who will score next? and ultimately Who will win? We present novel interpretable 
generative models of within-game scoring that allow for dependence on lead size {restoration) and 
on the last team to score {anti-persistence). We then apply these models to comprehensive within- 
game scoring data for four sports leagues over a ten year period. By assessing these models’ relative 
goodness-of-fit we shed new light on the underlying mechanisms driving the observed scoring dynam¬ 
ics of each sport. Furthermore, in both predictive tasks, the performance of our models consistently 
outperforms baselines models, and our models make quantitative assessments of the latent team 
skill, over time. 


I. INTRODUCTION 

Competition in social systems is a natural and perva¬ 
sive mechanism for improving performance and distribut¬ 
ing limited resources. The quantitative study of such 
competitions can improve our ability to predict the out¬ 
comes associated with specific strategies and the strategic 
choices that competitors may make. However, most real 
competitions take place in complex and evolving environ¬ 
ments [151ES] , which makes them difficult to study. Pro¬ 
fessional team sports, with their well defined and consis¬ 
tently enforced rules, provide a controlled setting for the 
study of competition dynamics [Ml EH ESI and have pre¬ 
viously been used as model systems for studying business 
decision making and human behavioral biases [551 ESj- 
The recent trend toward recording comprehensive and 
detailed data on the events within particular games pro¬ 
vides us with new opportunities to study, model, and 
predict the dynamics of these games. The results of these 
studies promise to shed new light on a wide variety of ex¬ 
isting competitive social systems, and enhance work on 
designing new ones, both offline and online. 

Here, we examine the time series of scoring events in 
all league games across four different team sports over a 
period of ten years. We construct and test probabilistic 
models for two online predictive tasks: given a partially 
observed game Who will score next? and ultimately Who 
will win? We then use these models to investigate the 
predictiveness of the dynamical phenomena of restoration 
and anti-persistenee, which are defined below. 

The events within a particular game can be effectively 
modeled as the interaction of skill and chance. Inferring 
skill from a series of competitions has a long history of 
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study, both for individuals mm and for teams [201135j . 
However, this past work has typically only considered the 
final outcome of games, in terms of either a win or loss, 
or the final point difference. Here, we focus on modeling 
the specific pattern of scoring events within an individual 
game. 

The role of chance also has a long history of study, 
typically focusing on the question of whether one success 
increases the likelihood of subsequent success. This idea 
can be formalized at different levels, e.g., success by in¬ 
dividual players within a game HElEn] , or a team’s 
success across multiple games m Ea EH- Here, for the 
first time, we focus on a different level: success by a whole 
team within a game. 

A simple starting point for such models is the basic 
idea of many skill ranking systems ElEI], which model 
game outcomes as random variables dependent on the 
competing teams’ skills. We extend this idea to consider 
the point-scoring events within a game to be a sequence of 
independent contests. Past work supports this approach, 
as some studies have found a lack of dependence between 
an individual scoring and their ability to score subse¬ 
quent points [161 189] . or between a team winning and 
their chance to win future games [SlETj. On the other 
hand, there is also evidence of non-independence, e.g., the 
probability of scoring itself can vary with the clock time 
within a game or with the size of the lead [MlEaEl]. To 
investigate the degree to which non-independence gov¬ 
erns scoring probabilities, we construct a sequence of 
more complex models that allow specific aspects of a 
game’s current state to influence scoring rates, e.g., the 
team that scored last and the lead size. 

In many sports, including American football and bas¬ 
ketball, a simple source of non-independence is a forced 
change in ball possession after each scoring event, putting 
the scoring team at a disadvantage. This can result in a 
phenomenon called anti-persistence, in which a score by 
one team is more likely to be followed by a score by their 
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opponent M- 

Another potential source of non-independence is the 
size of the lead itself. Past work has shown that the 
observed probability of scoring next can vary with lead 
size [HIIT]. A negative dependence may be the result of 
strategy, e.g., a team using its best players when it falls 
behind and substituting them out when they are ahead. 
Such strategies have a restorative effect on the lead size, 
serving to pull the size of the lead back toward zero. 
Conversely, anti-restoration or momentum occurs when 
the leading team has a higher chance of scoring again, 
perhaps by improving their control over the playing field 
or by learning from gameplay how to better exploit the 
weaknesses of the opposing team. 

In this paper, we develop probabilistic generative mod¬ 
els around these ideas to explore and predict the evolu¬ 
tion of point scoring over the course of a game. We use 
these models to deduce the impact of chance, strategy, 
and the rules of the game itself, and to test two simple 
hypotheses: 

1. the probability of scoring does not depend on the 
current state of the game (team skill alone mat¬ 
ters). 

2. the probability of scoring does depend on the cur¬ 
rent game state (as well as team skill). 

Our probabilistic models encode specific instances of 
these assumptions and we assess their accuracy under 
two online predictive tasks. We present novel predictive 
models that can not only predict the outcome of a game, 
but also provide better predictions over baseline models 
about the sequence of scoring events. 

II. RELATED WORK 

Our work addresses two novel prediction problems for 
predicting Who will score next? and Who will win?, us¬ 
ing only the sequence of scoring events that have already 
occured during the game. In the following we outline re¬ 
lated work to each of these questions in turn. 

Essential to answering the question Who will score 
next? is understanding the underlying mechanisms of 
scoring dynamics. The study of competitive team sports 
has a rich history spanning a broad selection of features 
including the timing of scoring events [3 m m m 
[?7l ESI ES], long-range correlations in scoring [30], the 
role of timeouts [3l|, the identification of safe leads [8|, 
and the impact of spatial positioning and playing field 
design [3 El SOI- The most relevant of these studies 
focuses on the analysis of individual player “momen¬ 
tum” or “hot-hands” H (H ES] and on team winning 
streaks [uni 131 El ESI- Here, we bring together these 
two ideas by considering the notion of momentum, or its 
reverse “restoration”, at the team level. Although some 
analysis has previously been undertaken in this direc¬ 
tion m, we go further to provide the first predictive 
models that answer the question: Who will score next? 


The foundations of our approach lie in the field of skill 
modeling and team ranking EE], which originated in 
the mid-20th century. Work in this area includes the 
ranking of individuals (S] [HI IIZ|j teams [IHl El E5] . 
or both E2|. These models have been applied to a 
wide range of competitive events, including baseball [32] , 
chess [1 Ellin], American football El EH , association 
football (soccer) [33] , and tennis m- More recently, they 
have been adapted to matchmaking problems in online 
games [2nj and to calibrating reviewer scores in computer 
science conferences m- 

Our work is the first to use skill ranking models to pre¬ 
dict Who will win? by predicting the sequence of scoring 
events within a game. Skill ranking models have previ¬ 
ously been applied to predicting game outcomes but only 
based on the final outcome of the game, either in terms 
of the win/loss result or the final point difference. These 
past approaches thus cannot update their prediction as 
the game unfolds, while our models can. We train on a 
history of scoring event sequences so that we may pre¬ 
dict Who will win? in an online fashion. Some commer¬ 
cial online sports betting systems exist that make similar 
online predictions, but these systems are proprietary and 
closed, which precludes a scientihc evaluation or compar¬ 
ison with our models. They are not considered hereafter. 


III. SPORTS DATASETS 

We use scoring event datcQ from four team sports: 
college-level American football (CFB, 10 seasons; 2000- 
2009), professional American football (NFL, 10 seasons; 
2000-2009), hockey (NHL, 9 seasons; 2000-2003, 2005- 
2009)[^and basketball (NBA, 9 seasons; 2002-2010). Each 
dataset consists of the set of scoring events for each game 
played in the season. It includes the time the event was 
scored, the team and player that scored, and its point 
value. Table|l|gives a summary of these data including the 
number of teams, games, and individual scoring events. 
In our analysis and modeling, we discard the timestamps 
of the events and instead consider only the order in which 
events appear within a game. 


A. Preprocessing 

We extract from the raw event data two sequences to 
represent each game: a point sequence 4 >, where ipi is the 
point value of scoring event i in the game, and a team 
sequence ^/>, where ipi G {r, b} is the identity of the team 
that won those points. If there are Nt events in game 


^ Data provided by STATS LLC, copyright 2015 
^ The entire 2004 NHL season was canceled due to an extensive 
lockout over a dispute about player salary caps m- 
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TABLE I. Summary of our sports data for multiple seasons across four team competitive sports. 


sport 

abbrv. 

seasons 

teams 

number of games 
total preprocessed 

number of scoring events 
total preprocessed 

mean events 
(preprocessed) 

Football (college) 

CFB 

10, 2000-2009 

461 

14,588 

13,689 

190,337 

117,752 

8.60 

Football (pro) 

NFL 

10, 2000-2009 

32 

2,645 

2,561 

32,800 

20,115 

7.85 

Basketball (pro) 

NBA 

9, 2002-2010 

30 

11,744 

11,744 

1,301,408 

1,096,179 

93.34 

Hockey (pro) 

NHL 

9, 2000-2009 

30 

11,813 

10,259 

65,085 

59,227 

5.77 


t, then the corresponding (j) and '0 each contain Nt ele¬ 
ments, and the lead size at event i is 

i 

, ( 1 ) 

i=i 

for team labels r and b (arbitrarily chosen), where d(.,.) 
is the Kronecker delta function and by convention we 
compute L from team r’s perspective. 

We begin by removing some games and scoring events. 
We remove any events that occur during regulation over¬ 
time (0.88% of all events), because these events follow 
different scoring processes than events in regular game 
time m- Additionally, any games in which only one team 
scored are removed (6.24% of all games), as the raw data 
do not indicate the identity of the non-scoring team. 

Under certain game conditions, multiple scoring 
events, potentially by different teams, can occur at the 
same game clock time. For example, in American foot¬ 
ball, the clock is stopped after a touchdown is scored but 
the scoring team gets a chance to score a conversion. If 
the conversion is unsuccessful, occasionally the opposing 
team gains control and scores points before the clock is 
restarted. Similarly, in basketball, the clock is stopped 
during free throws after a foul, after which the ball is 
inbounded (thrown in). If the ball is inbounded close to 
the other basket, it is possible to score before a second 
has elapsed on the clock. In these cases, the ordering of 
these events is ambiguous. 

Removing these events would alter the running lead 
size, which is one of the game states of interest. Instead, 
we merge simultaneous events into a single scoring play 
that removes the ordering ambiguity while preserving the 
correct score. If one team scores two simultaneous events 
i and * -I- 1, we merge their values, setting (pi = (j)i + ^i+i, 
and removing event i + 1 from both sequences. If two 
teams score simultaneously, we merge their values with 
that of the immediately preceding event in a way that 
preserves the running lead. Specifically, we set (pi-i = 
pi-i ± |(^i — <pi+i\^ where the sign is consistent with the 
previous assignment of r and b labels to teams, and then 
remove events i and i + \ from both sequences. 


B. Scoring and lead size 

We use these point and team sequences to make an 
initial investigation of our hypotheses. If the scoring dy- 


CFB 


NBA 



-100-50 



lead size 


lead size 


FIG. 1. Probability that a team scores next as a function of 
its lead size, for the observed (yellow) and simulated (black) 
patterns, each with a linear least squares fit line. The simu¬ 
lated scoring sequence assumes that the probability of scoring 
is independent of the game’s state. 


namics are truly independent of the game’s state, these 
dynamics will be indistinguishable from an independent 
Bernoulli process, in which each Bernoulli trial represents 
a scoring event. We evaluate this model by calculating 
the empirical probability that a team will score the next 
event as a function of the current lead size L. Recall that 
we compute L from the perspective of team r; thus, if r 
is leading, then L is positive, while if r is trailing, then 
L is negative (and vice versa for b). This function is thus 
rotationally symmetric about a lead of L = 0, where 
neither team leads, and has the mathematical form of 
P('ipi = r I Lj_i) = 1 - P(V’i = b I -Li_i). 

We compare the empirical scoring function to one cal¬ 
culated from synthetic team sequences generated accord¬ 
ing to an independent Bernoulli process, in which we flip 
a biased coin to determine which team wins each scoring 
event. The coin’s bias is determined by the proportion of 
scoring events each team wins in that particular game, 
^ (or for b). In this simulation, events are 

thus independent of the game state (hypothesis 1). We 
also compute a least-squares regression line for the em¬ 
pirical and for the synthetic data, in which each point 
is given weight proportional to the number of times the 
corresponding lead size was observed. 

All of the resulting gradients relating scoring probabil- 
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ity to lead size are nonzero (Fig. [^, and each Bernoulli 
process produces a positive gradient. This pattern simply 
reflects the empirical distribution of biases used to sim¬ 
ulate the ensemble of games, with a more positive slope 
reflecting broader variance in these biases. The variance 
in the estimated scoring probability increases with lead 
size simply because progressively fewer games produce 
leads of that magnitude. 

Comparing the observed and simulated scoring func¬ 
tions (Fig. [^, we observe a clear contradiction. The gra¬ 
dient and, in particular for NBA, the range of lead sizes 
generated by the Bernoulli process disagree strongly with 
those properties observed in the empirical data. These re¬ 
sults suggest that the probability of scoring does indeed 
depend, somehow, on the game state (hypothesis 2). In 
subsequent sections, we investigate this dependence us¬ 
ing sophisticated probabilistic models to determine how 
the probability of scoring depends on game state. 


IV. WHERE STANDARD TESTS FAIL 

To determine whether scoring events are independent, 
we now apply a suite of statistical randomization tests, 
which compare observed sequences to random sequences 
with similar properties. Specifically, we employ the 

• serial test (non-uniformity), 

• Wald-Wolfowitz runs test (anti-restoration), and 

• autocorrelation test (persistence/anti-persistence), 

where for each the null hypothesis is that the team se¬ 
quence Ip is simply a random sequence. 

The serial test [25) examines bigram frequencies in a 
sequence and compares them to their expected frequen¬ 
cies under a uniformly random sequence. For a team se¬ 
quence with N elements, the observed fractions of bi¬ 
grams {rr,rb,br,bb} are compared to their expectations 
of N/A. This test can identify the existence of a bias 
within each game, i.e., if one team is systematically more 
likely to score than another. 

The Wald-Wolfowitz runs test [3S] examines the ob¬ 
served number of runs in a sequence, i.e., substrings of 
for which each element is the same (either r or b), which 
allows us to identify either positive momentum or anti¬ 
restorative effects in within-game scoring. We reject the 
null hypothesis that ip is random if the observed number 
of runs is significantly below its expected value. Previ¬ 
ously, this test has been used to detect winning streaks 
in sequences of games [37] . 

The autocorrelation test measures the correlation of a 
sequence with itself, shifted by one element, which allows 
us to identify periodic dynamics that occur as a result of 
anti-persistence. Here, we reject the null hypothesis that 
Ip contains no dependence between values if the autocor¬ 
relation is significantly higher or lower than zero. 

We apply each of these three tests to each of our four 
data sets, and compare the results against a false positive 



FIG. 2. (top) Probability distributions for the number of scor¬ 
ing events in a game, and (bottom) the randomization test 
results for each sport, by season, versus a false positive rate 
of a = 0.05 (dashed line). The team sequences of each game 
are tested independently and we plot the proportion of games 
that reject the null hypothesis that the sequences are random. 
Because CFB, NFL and NHL typically have a small number of 
events per game (upper panel), the null hypothesis is difficult 
to reject. 


rate of a = 0.05 (Fig. [^. We also consider each season 
separately so as to reveal non-stationarities. Basketball, 
unlike the other sports, produces a large proportion of 
rejections for the serial and autocorrelation tests, which 
reflects the known anti-persistence pattern in basketball 
scoring [l4] . 

On the other hand, for all sports except basketball, 
each of these tests rejects the null hypothesis at close to 
or below the chosen false positive rate, a finding consis¬ 
tent with each of these sequences being random. How¬ 
ever, this interpretation is problematic. The serial test 
makes the very strict assumption that each sequence is 
drawn from a uniform random distribution, i.e., each is 
generated by flipping a fair coin several times. A face- 
value interpretation thus implies that all teams have an 
equal chance of winning each game—a highly unlikely 
situation—and it predicts that the scoring function from 
Section im should be independent of lead size, which 
contradicts the observed pattern (Fig. [^. 

In fact, however, there is no contradiction: the tp se¬ 
quences are simply too short (Fig. for these tests to 
reliably distinguish random from non-random sequences 
when we assume they are generated independently, i.e., 
the tests have low statistical power. The one exception 
is basketball, whose sequences typically contain 90 or so 
events, while those for American football or hockey typ¬ 
ically contain less than 10. 

In the following sections, we show how to circumvent 
the low statistical power of these tests by exploiting the 
fact that team sequences are not, in fact, independent 
of each other. Instead, each season’s sequences are gen¬ 
erated by repeatedly selecting pairs from a finite and 
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fixed population of teams. This process induces substan¬ 
tial correlations across games that we can capture by 
modeling the latent skills of each team within a given 
season. 


V. SKILL-BASED SCORING DYNAMICS 

Toward this end, we develop a series of models of in¬ 
creasing complexity based on specific underlying mech¬ 
anisms for sports scoring dynamics, including indepen¬ 
dence, restoration, and anti-persistence. Each of these 
models represents team skill as a latent variable. We as¬ 
sume that team skill is fixed over the course of any partic¬ 
ular season [TS], which reflects the relatively stable team 
rosters and coaching staffs, and low injuries rates in these 
sports. Furthermore, modeling each season separately al¬ 
lows us to run multiple tests for each sport—one for each 
season—and allows our models to capture real changes 
in team skill across seasons [18] . 

Each of our models generates a team sequence '0 by 
extending the popular Bradley-Terry (BT) model |6] to 
generate individual scoring events within a game. Tra¬ 
ditionally, the BT model is used to estimate unobserved 
(latent) team skills from the observed outcomes of many 
games among pairs of teams. The probability that team 
r wins in a match against team b is given by the skill of 
r relative to b: 

TT 

P{r wins against b) = drt = - - — , (2) 

TTr -b TTf, 

where tti, G [0,1] is the latent skill for team r. 

A. Independent model 

When scoring events within a game are independent, 
their generation is equivalent to a simple Bernoulli pro¬ 
cess with a game-specific bias. This is equivalent to an 
“independent model” that applies the game-level BT 
model of Eq. ([^ to each of the individual scoring events 
within a game, yielding 

= r) = drb ■ (3) 

This represents our first model, which can capture vari¬ 
ability in a team sequence caused by differences in team 
skill parameters, but not other sources of variability. 

B. Restorative models 

Real scoring functions (Fig. produce a range of gra¬ 
dients. However, the independent model can only pro¬ 
duce positive slopes. To capture a wider variety of scor¬ 
ing function shapes, and in particular a negative slope or 
“restorative” pattern, we extend the independent model 
by allowing each team’s skill to explicitly covary with 


its lead. Such a relationship could arise for psychological 
reasons, e.g., a winning team “loses steam” or a losing 
team gains motivation | 3 |, or for strategic reasons, e.g., 
substituting out or in the more skilled players while in 
the lead in order to conserve their energy, avoid injury, 
or create momentum m- 

Our restorative model augments the independent 
model with an explicit per-team “restorative force” pa¬ 
rameter 7 ^., which modifies team r’s strength in response 
to the current lead size from its perspective £r and cap¬ 
tures the fact that different teams may have different 
behaviors in response to how far ahead or behind they 
are. When jr < 0, team r exhibits a restorative pattern, 
with skill being proportional to —£r- When 7 ^ > 0, team 
r exhibits an anti-restorative or momentum pattern, with 
skill being proportional to ir- 

The probability that team r scores against b is given 

by 

— drb T £ir^rh , (4) 

where (.ir is r’s lead size just before event i and 

Crb — 7?" ~b • (b) 

A game as a whole exhibits a restorative pattern when¬ 
ever Crb < 0. This occurs either when both teams exhibit 
a restorative pattern themselves ( 7 ,. < 0 and 7 t, < 0 ) or 
when one team’s restorative force is stronger than the 
other team’s anti-restorative force ( 7 ^ < 0 , 7 ;, > 0 , and 
l7r-| > |7b|)- 

The additional term in Eq. ([4) relative to the in¬ 
dependent model means this model’s scoring function 
is no longer bounded on the [0,1] interval. We correct 
this behavior by using a sigmoid function of the form 
cr{x) = (I -b e^’^) ^ to provide a smooth and continuous 
approximation of the misspecified linear function. 

To make this approximation, we change variables so 
that a logistic curve most closely approximates the linear 
equation, which occurs when we match the gradients at 
the point of symmetry at P(0i = r) = 1/2. Setting the 
derivative u' equal to Crb, we find 

JJI ^Qm.rbiir+Vrb 

a'{mrb(rr + Vrb) = + e'’^b )2 = ’ (6) 

We then solve for when the logistic function equals 1/2, 
yielding 

<7{mrb(ir + Vrb) = j ^-(nirbeir+Vrb) ^ 

Finally, in solving Eqs. (§ and Q we obtain the fol¬ 
lowing transformation of variables: 

Ur-b =-4 (1/2 - drb) (8) 

TOrb = 4 Crb ; (9) 

where rrirb and Vrb are the variables used in the logistic 
function such that Crb and drb retain their linear interpre¬ 
tation and are thus comparable to the skill variables in 
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FIG. 3. Two examples of linear functions matched to logistic 
functions using the change of variables in Eqs. Q and 


the independent scoring model. Figureshows examples 
of two linear functions and their corresponding logistic 
approximations. 


C. Anti-persistence models 


In many sports, we observe an anti-persistent pattern 
in the team sequences, in which the probability that 
r scores next depends on which team scored last, i.e., 
P('0i+i = r I -tpi). For example, for NBA team sequences, 
the rate of rr and bb bigrams is only 0.35, indicating 
strong anti-persistence. (The rates for CFB, NFL, and 
NHL are 0.45, 0.44, and 0.49, respectively.) Such an anti¬ 
persistence pattern can occur when teams have different 
degrees of skill at defensive and offensive play, e.g., when 
both teams have offenses that are relatively stronger than 
the opposing team’s defense. 

To capture these effects, we extend the independent 
model so that each team has an offensive skill parame¬ 
ter 7r°® and a defensive parameter For sports like 
American football and basketball, ball possession (offen¬ 
sive play) typically alternates after a scoring event. We 
model this game rule by applying a team’s defensive skill 
immediately after it scores and its offensive skill after the 
other team scores. Under this independent anti-persistent 
model, the probability of scoring event i is 


P{il^i = r I ipi-x) 



if V'i-i = r 


if V'i-i = b 


( 10 ) 

Finally, we obtain a fourth model by combining the 
restorative model with the anti-persistent model. 


VI. MODELING SCORING DYNAMICS 

We fit the (i) independent, (ii) restorative, (iii) in¬ 
dependent anti-persistent, and (iv) restorative anti- 
persistent models to the team sequences within a given 
season of each sport, using Markov chain Monte Carlo 
to estimate each model’s parameters. For each, we assess 


model goodness-of-fit by calculating the held out like¬ 
lihood for each model under a 10-fold cross validation. 
Furthermore, we follow this procedure for each season of 
each sport separately, the results of which are given in 
Tables [llHV] By treating seasons independently, we ob¬ 
tain multiple model assessments within each sport while 
controlling for within season variability. For each season, 
we highlight the two highest scores in blue and the high¬ 
est score in bold. 


In basketball (NBA), we find that the restorative anti- 
persistent model consistently provides the best fit across 
all seasons (Table 0 , with the second best model be¬ 
ing the independent anti-persistent model. These results 
indicate a strong role for both restoration and anti¬ 
persistence in driving basketball scoring dynamics. Pre¬ 
vious analysis of basketball scoring using random walk 
theory came to similar conclusions M- 

American football (NFL and CFB) shows a different 
result, with both types of independent model being heav¬ 
ily favored over both types of restorative model (Tables 
nil and IV ). The poor fit here of the restorative models 


indicates that the competitive processes that produce a 
restorative force in basketball are largely absent in Amer¬ 
ican football. This difference may be related to the much 
greater scoring rate in basketball relative to American 
football (Fig. [^: an increased scoring rate lowers the 
marginal value of each scoring event relative to the game 
outcome (who wins), and low value interactions in other 
systems are associated with restorative forces [iniiis]. 


Furthermore, the anti-persistent model for NFL is fa¬ 
vored in 8 of 10 seasons over the independent model, 
while in CFB, it is favored in only 2 of 10 seasons. That 
is, anti-persistence appears to play a stronger role in NFL 
games than in CFB games. In fact, CFB is the only sport 
to strongly favor the independent model, a result that 
agrees with the our previous simulation results (Fig. [^, 
which showed that the trivial independent model pro¬ 
duced the smallest disagreement for CFB between real 
and simulated scoring function gradients among the four 
sports. 

The results for hockey (NHL) are less clear cut (Ta¬ 
ble 1^. In 8 out of 9 seasons, the independent anti- 
persistent model is either the best or second best model, 
and the independent model is best or second best in 
7 out of 9. On the other hand, the simple restorative 
model wins for 2 seasons, and is second best for one. 
(The restorative anti-persistent model is a poor fit for 
all hockey seasons.) We note, however, that the log- 
likelihoods among these three models are all very close, 
indicating that each performs about as well as the oth¬ 
ers for these data. Given that NHL is also the one sport 
among the four that is not anti-persistent by design (pos¬ 
session is determined by a “faceoff” after each goal) and 
that its scoring function has a negative gradient, we ten¬ 
tatively conclude that the restorative model is better. 

Across seasons, the best overall models appear to be 
CFB: independent; NFL: independent anti-persistent; 
NBA: restorative anti-persistent; and NHL: restorative. 
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TABLE II. Log-likelihoods on held-out data for NBA games. 



2002 

2003 

2004 

2005 

2006 

2007 

2008 

2009 

2010 

Independent 

-80849 

-78814 

-84698 

-84744 

-84795 

-86070 

-85727 

-86314 

-85114 

Restorative 

-80573 

-78506 

-84361 

-84404 

-84469 

-85777 

-85444 

-86005 

-84704 

Independent anti-persistent 

-75655 

-73823 

-79151 

-78841 

-79088 

-80174 

-79841 

-80513 

-79386 

Restorative anti-persistent 

-75627 

-73777 

-79097 

-78796 

-79040 

-80141 

-79812 

-80465 

-79297 


TABLE III. Log-likelihoods on held-out data for NFL games. 



2000 

2001 

2002 

2003 

2004 

2005 

2006 

2007 

2008 

2009 

Independent 

-1286 

-1307 

-1408 

-1372 

-1403 

-1373 

-1369 

-1433 

-1484 

-1395 

Restorative 

-1324 

-1347 

-1450 

-1402 

-1451 

-1422 

-1424 

-1466 

-1530 

-1432 

Independent anti-persistent 

-1278 

-1290 

-1401 

-1361 

-1392 

-1378 

-1372 

-1425 

-1473 

-1387 

Restorative anti-persistent 

-1322 

-1337 

-1450 

-1496 

-1448 

-1427 

-1434 

-1470 

-1520 

-1426 


TABLE IV. Log-Iikelihoods 

on held-out data for CFB 

games. 





2000 

2001 

2002 

2003 

2004 

2005 

2006 

2007 

2008 

2009 

Independent 

-7487 

-7575 

-8098 

-8105 

-7675 

-7708 

-7265 

-8673 

-8435 

-8097 

Restorative 

-8114 

-8182 

-8689 

-8656 

-8268 

-8176 

-7884 

-9334 

-9065 

-8777 

Independent anti-persistent 

-7486 

-7643 

-8142 

-8201 

-7741 

-7759 

-7328 

-8678 

-8458 

-8078 

Restorative anti-persistent 

-8011 

-8113 

-8625 

-8586 

-8198 

-8110 

-7781 

-9214 

-8880 

-8630 


TABLE V. Log-likelihoods on held-out data for NHL games. 



2000 

2001 

2002 

2003 

2005 

2006 

2007 

2008 

2009 

Independent 

-4432 

-4238 

-4300 

-4078 

-5026 

-4712 

-4504 

-4755 

-4655 

Restorative 

-4432 

-4238 

-4313 

-4056 

-5031 

-4695 

-4511 

-4761 

-4663 

Independent anti-persistent 

-4420 

-4237 

-4287 

-4068 

-5020 

-4706 

-4497 

-4761 

-4668 

Restorative anti-persistent 

-4449 

-4254 

-4318 

-4090 

-5045 

-4721 

-4521 

-4787 

-4687 


CFB NBA 



FIG. 4. Probability that a team scores next as a function of 
its lead size, for the observed (yellow) and simulated (black) 
patterns, each with a linear least squares ht line. Each simula¬ 
tion uses the best overall skill model for that sport to generate 
synthetic point and team sequences. 


We check these models by performing a semi-parametric 
bootstrap, generating synthetic (j) and tp sequences of the 
same number and lengths as observed empirically in each 


season, and comparing the simulated and empirical scor¬ 
ing functions. That is, we repeat the assessment of Fig¬ 
ure but now using models that can capture depen¬ 
dence across sequences. The results show that our skill- 
based models are a dramatic improvement over simulat¬ 
ing each game independently (Fig. |^, agreeing closely 
with the empirical scoring patterns in both the gradient 
and range of lead sizes. 


VII. PREDICTING OUTCOMES 

We now apply our models to two online prediction 
tasks in each of the sports: Who will score next? and 
Who will win? For both tasks, we let our models observe 
the point and team sequences of the first T games in a 
particular season. We then use these models to predict 
for each unobserved game in that season (i) the team se¬ 
quence values for 1 < i < N, and (ii) the identity of 
the winning team, when each model is allowed to observe 
the game states {tj}j,(j)j) for 1 < j < L In the second task, 
all models predict point values (pi as the mean value {(p) 
averaged over all events in the season. We compare our 
predictions to those of three baseline models. 

The first baseline is a naive leading model, which as¬ 
sumes that the team currently in the lead is the stronger 
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A. Who will score next? 



proportion of season observed 


0.68 

0.66 

0.64 

0.62 

0.60 

0.58 

0.56 

0.54 

0.52 

0.50 


0.55 

0.54 

0.53 

0.52 


proportion of season observed 





_ 

independent 


— 

restorative 


" " 

independent anti-pers 
restorative anti-pers 



Bradley-Terry 


*— A 

first order Markov 



leading 




NHL 


FIG. 5. Probability of accurately predicting which team will 
score next (AUC), when models observe different fractions of 
a season. Based on 95% confidence intervals, our best model 
performs significantly better than the baseline models for 
CFB and NBA, and after observing at least half of the season 
for NFL and NHL. 


team and thus more likely to both score next and win 
the game. Specifically, it predicts that team holding the 
lead at event i will win the next event, i.e., it predicts 
'ipi+i = r if L > 0 and f/’i+i = 6 if L < 0, and will also 
win the game. If L = 0, the model flips a fair coin for r 
and b. 

The second baseline is the standard Bradley-Terry 
model in which we infer latent team skills tt from the win- 
loss records among teams in the observed games. This 
model is simpler than our independent model, which in¬ 
fers team skills using team sequences {■0} of the observed 
games. 

The third baseline is a simple first order Markov 
model. It predicts that the next team to score will either 
be the same or different than the team that scored last ac¬ 
cording to the empirical bigram frequencies {rr, bb, rb, br} 
observed in the first T games of the season. Formally, it 
predicts that a team will score again given it scored last 
time as 


/ T Nt-l \ / ( ^ \ 

P(0i+1=00 = ( X! X! j / 

( 11 ) 


For both prediction tasks, we assess prediction accu¬ 
racy via AUC statistic, which gives the probability that 
a randomly selected true positive is ranked above a ran¬ 
domly selected false positive. The AUC is a statistically 
principled measure for binary classification tasks like ours 
where the cost of an error is the same in either direction 
(since team labels, r or 6, are arbitrary). 


In the first task, we aim to predict which team will 
score event I, for each 1 < i < iV, given the sequence 
of preceding game states (0^,0^) for 1 < j < i. For this 
online prediction task, we learn each model’s parameters 
from the first T games in a season and then make pre¬ 
dictions across all unobserved games within a season and 
calculate the AUC for all predictions across all seasons to 
obtain a single score. Each model observes at least 10% 
of a season, which ensures that every team has played at 
least a few times. 

The results show that the overall best models identified 
in the previous section also tend to be the best predictors 
at who will score next (Fig.[^, although some alternative 
models also perform well. For instance, the best model for 
NFL games early in the season is the first order Markov 
model; however, the best NFL model beats this baseline 
after about 30% of a season is observed. Similarly, the 
first order Markov model performs almost as well as the 
best skill model in predicting who will score next in NBA 
games, by capturing the known anti-persistence pattern 
in that sport. One of the worst models across all four 
sports is the leading baseline, which often performs only 
slightly better than chance. 


B. Who will win? 

Predicting who will win a game requires extrapolating 
the point and team sequences to determine the game’s 
final outcome. We simplify this task slightly by assum¬ 
ing that the number of scoring events N in each game is 
known. We then allow the models to learn their parame¬ 
ters from the first 30% of each season (other choices lead 
to qualitatively similar results as those reported here). 
For each game in the remainder of a season, the models 
predict the identity of the winning team when they are al¬ 
lowed to observe a progressively greater fraction of game 
states {(pi, 0i) for 0.1 < i/N < 0.9—as if each model were 
watching the game unfold in real time. 

The results show that the overall best model for each 
sport both consistently outperforms the baselines and 
also correctly predicts the winner with at least 80% ac¬ 
curacy at a game’s halftime (Fig. |^. 

The relatively poorer performance of the “leading” 
baseline model illustrates that this prediction task is 
non-trivial—who is leading at a given moment is not 
as predictive of who wins as knowing something about 
team skills and scoring dynamics. On the other hand, the 
Bradley-Terry baseline performs comparably well very 
early in the game, but is quickly beaten because it cannot 
learn from the real-time evolution of a game. 

For this task, most of our skill-based models make very 
similar predictions and the first order Markov model also 
performs well. Although the distributions of final lead 
sizes may be different, the means are very close, and the 
individual predictions across models correlate strongly. 
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CFB NBA 




NFL NHL 




FIG. 6. AUC scores for predicting which team will win given 
the current state of the game. 

The greatest difference occurs at the start of the game. In 
particular, the first order Markov model performs much 
worse than the skill-based models at the beginning be¬ 
cause it has no information about the heterogeneity of 
team scoring abilities. As the game progresses the predic¬ 
tions tend to converge. This occurs because these models 
all make predictions based on random walks on a binary 
sequence {r,b}, the difference being in how they model 
the transition probabilities. Later in the game we extrap¬ 
olate less and so the differences between models become 
less pronounced. 


VIII. TEAM SKILL EVOLVES OVER TIME 

A useful feature of our probabilistic models is the in¬ 
terpret ability of their parameters, which are meaningful 
measures of team skill here. By learning these parame¬ 
ters independently for each season in each sport, we can 
investigate how team skills have evolved over time. 

Using the best overall model for each sport, we learn 
its parameters using all data in each particular season 
and calculate the Spearman rank correlation across team 
skills for each pair of seasons (Fig. [^. We find that 
the relative ordering of teams by their inferred skills ex¬ 
hibits strong serial correlation over time, which appears 
as a strong diagonal component in the pairwise correla¬ 
tion matrices. The low or inverse correlation in the far 
off-diagonal elements, as well as the block-like patterns 
observed in CFB and NFL, implies an underlying non- 
stationarity in team skills for each of the leagues over the 
roughly 10-year span of data. 

The manner in which team rosters change over time 
is a likely source of such long-term dynamics in relative 
team skill. At short time scales, team rosters are fairly 
stable, with only a few players changing from season to 
season. However, over longer time scales, these changes 
accumulate, and rosters separated in time by more than a 


FIG. 7. Gorrelation of inferred skills over years for each sport. 
We see that the highest correlations occur along the block di¬ 
agonal indicating that adjacent years are more similar. Note 
that the scale is different for GFB due to a much higher cor¬ 
relation across all years. 

few years are likely to be very different, with concomitant 
differences in team skill. 

The exception to this pattern is CFB, which shows a 
larger long-term correlation, i.e., a slower rate of change 
in relative team skills, than in professional sports. We 
speculate that this difference is caused by the difference 
in player mobility between college and professional-level 
sports: professional teams operate in a national player 
market, and players can move relatively freely among 
teams, while colleges operate as rough regional monopo¬ 
lies over the sources of their players. 

The inferred season-by-season skill orderings them¬ 
selves are also of interest, as they reveal the particular 
trajectories of individual teams over time. We show vi¬ 
sualizations of these trajectories for NBA and NFL in 
Figures and We omit CFB because there are too 
many teams (461) to meaningfully visualize and NHL for 
space reasons. 

For each plot we highlight the two teams that won the 
league championship (NFL Super Bowl and NBA Finals) 
more than once during the period covered by the dataset. 
It is notable that these teams are not necessarily the most 
skilled teams under our model. This is unsurprising, as 
tournaments by bracket are the highest variance method 
of identifying the most skilled team [3]. Interestingly, in 
both NFL and NBA games, the highlighted teams tend 
to have strong offensive skills, while their defensive skills 
are more variable. This pattern suggests that offensive 
skills are more important for winning games, which seems 
reasonable given that a strong defense alone cannot win 
a game. 

Looking at individual teams, we can see how their 
skills change with respect to the total ordering. For in¬ 
stance, the Cleveland Cavaliers drafted LeBron James in 
2003 and went from being ranked the third worst (of- 
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Washington Wizards 
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Utah Jazz 
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FIG. 8. NBA defensive (top) and offensive (bottom) skill rankings. Teams that won more than one NBA finals game in 
are highlighted, i.e., Lakers (orange) 2002, 2009 and 2010, Spurs (black) 2003, 2005 and 2007. 
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FIG. 9. NFL defensive (top) and offensive (bottom) skill rankings. Teams that won more than one NFL super bowl games in 
the data are highlighted, i.e., Patriots (black) 2002, 2004 and 2005, Steelers (orange) 2006 and 2009. 
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fensive) team to a mid-range one. When James left for 
Miami Heat at the start of the 2010-11 season, we see 
the Cavaliers’ offensive skill drop to the bottom ranked 
team, while Miami Heat’s offensive and defensive skills 
increased to be ranked third and first respectively. We 
also see that the Los Angeles Lakers’ skills (both offen¬ 
sive and defensive) drop for the 2004-05 season. After 
facing a difficult 2003-04 season PH , they disbanded the 
team, lost their coach and faced a number of injuries, 
resulting in a poorer performance in 2004-05. 

Finally, the range of values occupied by offensive and 
defensive skills is different between NFL and NBA teams: 
in the latter, these two skills occupy non-overlapping 
ranges (large and small respectively), while in the for¬ 
mer, they fall in similar ranges. That is, NBA teams are 
less likely to score when playing defensively than when 
they have possession of the ball, which serves to create a 
stronger anti-persistence scoring pattern (0.36) than for 
NFL (0.44), where skills are more evenly matched. 

IX. CONCLUSION 

In this work we considered the online prediction tasks 
of Who will score next? and Who will win? based on the 
sequence of scoring events in the game so far. Our proba¬ 
bilistic models based on latent team skills perform well at 
both predictive tasks and can predict with a high degree 
of certainty (> 80%) who will win a game in each of the 
four leagues we studied, after only half of the game has 
elapsed. Furthermore, by using gameplay, i.e., the partic¬ 
ular sequence of events within each game, to model team 
skill rather than just game outcomes, we can infer differ¬ 
ent types of latent team skills e.g., offensive vs. defensive 
skills. 

Our statistical models provide a quantitative and prin¬ 
cipled means of capturing and testing hypotheses about 
the variability induced by chance, the biases produced by 
real differences in team skill, and the structural impact 
of game-specific rules. In applying these models to com¬ 
prehensive data from four different sports, we found that 
each of the leagues is best fit by a different model. This 


indicates that skill, luck, strategy, and the rules of the 
game serve different roles in the scoring dynamics across 
sports. The exception was professional hockey (NHL), 
where the very low scoring rate resulted in no clear pre¬ 
dictive winner among our models. 

Our models and results open up many new directions 
for future work. For instance, we could incorporate other 
data such as player or ball positioning [3 SO], or the 
timing of events [27j to improve our models and allow us 
to apply them to low scoring sports such as soccer. These 
models could also be used to make other predictions, e.g., 
the number of scoring events in a game, the final score, 
and when a lead is safe [8] , and to produce more rigorous 
team rankings (Figs. and |^. 

In addition to data on gameplay, data on individual 
player attributes and performance in competitive settings 
are also often available, e.g., height, strength, speed, ac¬ 
curacy when scoring, defensive skill, passing skill, etc. 
However, there are no good models that connect these 
characteristics to team skills and to gameplay as a means 
of predicting game outcomes. The models we formulated 
here solve part of this problem by connecting team skill to 
gameplay. An interesting direction for future work would 
be to predict outcomes from player statistics via team 
skills as an intermediary. Such a model would allow teams 
to make more data-driven choices about how they build 
team rosters and train their players. This extension of 
our work would also open up new avenues in designing 
realistic simulations of competitive play, e.g., for better 
AI in video games. 
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